Generalizable clinical note section identification with large language models

Abstract Objectives Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as named entity recognition. However, the traditional supervised methods suffer from transferability issues. This study proposes a new framework for using large language models (LLMs) for section identification to overcome the limitations. Materials and Methods We framed section identification as question-answering and provided the section definitions in free-text. We evaluated multiple LLMs off-the-shelf without any training. We also fine-tune our LLMs to investigate how the size and the specificity of the fine-tuning dataset impacts model performance. Results GPT4 achieved the highest F1 score of 0.77. The best open-source model (Tulu2-70b) achieved 0.64 and is on par with GPT3.5 (ChatGPT). GPT4 is also found to obtain F1 scores greater than 0.9 for 9 out of the 27 (33%) section types and greater than 0.8 for 15 out of 27 (56%) section types. For our fine-tuned models, we found they plateaued with an increasing size of the general domain dataset. We also found that adding a reasonable amount of section identification examples is beneficial. Discussion These results indicate that GPT4 is nearly production-ready for section identification, and seemingly contains both knowledge of note structure and the ability to follow complex instructions, and the best current open-source LLM is catching up. Conclusion Our study shows that LLMs are promising for generalizable clinical note section identification. They have the potential to be further improved by adding section identification examples to the fine-tuning dataset.


Introduction
Clinical notes are a rich component of electronic health records (EHR) and natural language processing (NLP) methods can help extract useful information from notes and assist clinical reasoning and knowledge finding. 1,2Clinical notes are organized by sections and automatically identifying sections can benefit downstream tasks such as named entity recognition, 3 cohort discovery, 4 and symptom negation. 5Lei et al 3 found named entity models that included section information achieved the highest accuracy.Edinger et al 4 found section information increased the overall F1 score for cohort retrieval.In addition, identifying the social history section can be helpful for the extraction of social determinants of health. 6Sometimes identifying sections can be conducted manually by humans, but it becomes infeasible for large-scale clinical note analysis that involves thousands of patients and clinical notes, 7,8 making it necessary to create a machine learning model to automate the process.
0][11] They achieve F1 scores as high as 0.90 in the source healthcare institution where they were annotated but usually drop significantly (as low as 0.6 F1) when transferred to another healthcare institution.Part of this could be due to the note-taking differences between healthcare institutions.In addition, these models are highly restricted by the annotation schema, making it difficult to apply trained models to different note types or the same note type but written with a different structure.For example, a model trained on discharge summaries would have difficulty in applying to progress notes, because some section types (eg, "overnight progress") are usually not found in discharge summaries.Zhou et al 12 approached this problem with more traditional supervised learning in a domain adaptation setting, but was forced to map from fine-grained to coarse-grained SOAP sections.The large language model (LLM)-based approach in this work allows us to address the same problem but without having to simplify the output space (see Figure 1 for an example).Lastly, per-section type accuracy could be dependent on the number of annotations available. 12For section types that are scarce, supervised learning models tend to have a lower accuracy due to insufficient annotated examples.
Recent studies that applied LLMs such as ChatGPT 13 and GPT4 14 to healthcare domains achieved promising results. 15LMs use little or no annotated data and function in a question-answering way.It answers after receiving a humanwritten instruction, which can be rewritten or revised for different scenarios.However, like other deep learning models, LLMs lack interpretability, 15 which makes their application in high-stakes domains like medicine riskier.

Objectives
In this study, we experiment with applying LLMs to section identification with 2 goals.First, we measure state-of-the-art LLMs performance for section identification with different accessibility (open-sourced and closed-sourced) and with varying parameter sizes.Second, we seek to better understand LLMs by analyzing their behavior on section identification, which is a comprehensive task that involves clinical knowledge understanding, verbatim copying, label assignment, and output formatting.

Dataset
We refer to the evaluation dataset we use in this study as discharge, which was created in a prior study, 11 and consists of discharge summaries from the Partners HealthCare and Beth Israel Deaconess Medical Center and was originally released through the i2b2 challenge.The dataset consists of 92 notes with 29 section types, which we map to 27 section types to remove redundant categories (see Supplementary Material).We selected 50 notes for the test set, ensuring that, when combined with the prompt, each input to the LLM is at most around 3500 tokens, so there is enough room for the model (context window at 4096 tokens) to generate responses.In other words, we selected 50 notes from discharge specifically for a length threshold so that they would fit into the input token windows for each model, leaving sufficient space for output tokens.This helps for making comparisons between the LLMs-although the version of GPT4 we use allows for a context window of 8k tokens, the other models are limited to 4k tokens.The rest of the notes are used for development.
We experimented with fine-tuning LLMs using ORCA, 16,17 an open-source general-domain instruction tuning dataset that contains question-answer pairs collected from ChatGPT/GPT4 using self-instruct.Self-instruct 18 is the technique composing questions from existing annotated datasets and using them to query ChatGPT/GPT4 to create high-quality answers.The ORCA dataset used in our study contains 0.5 million question-answer pairs created using GPT4.We also experimented with a domain-specific fine-tuning dataset.We used progress, a dataset from a prior study, 12,19,20 which consists of 765 progress notes with 17 section types, and we reduced to 15 section types by mapping several categories to "Unknown."(see Table S2).The dataset is from Beth Israel Deaconess Medical Center and is released as a part of MIMIC-III.We converted the progress notes into question-answer pairs for instruction tuning as described in future sections.The discharge dataset reported an inter-rater agreement F1 score of above 0.9.The progress dataset does not report an inter-rater agreement on the section labels and it was only used during fine-tuning and not for testing.

Off-the-shelf model
We evaluated GPT3.5 (ChatGPT) and GPT4, 13,14 which are the state-of-the-art closed-source models.We also evaluated stateof-the-art open-source models with 13 billion or 70 billion parameters, Llama2-13b-chat, 21 Vicuna-13b 22 (v1.5),Llama2-70b-chat, 21 and Tulu2-70b 23 (with direct policy optimization 24 applied).The Llama2 models are developed by Meta. 21Vicuna-13b is developed by researchers in UC-Berkeley LMSYS group and is a state-of-the-art 13 billion model.Tulu2-70b is developed by Allen Institute for AI and is reported to be competitive against ChatGPT. 23We selected these models because they were all based on Llama2, making them easy to access and compare.We are also aware that the field of LLMs is moving rapidly and these open-source models were only state-of-the-art at the time when the study was conducted, more advanced models like Llama3 25 and Mistral 26 were released later and we leave them to future studies.
All open-source models are tuned on the top of Llama2-13b-base or Llama2-70-base. 21Llama2-13b-chat and Llama2-70b-chat are instruction tuned followed by reinforcement learning-based techniques for aligning with human preferences (ie, reinforcement learning from human feedback-RLHF).Vicuna-13b is trained on human-ChatGPT conversations users shared online. 27Tulu2-70b is trained with a large and diverse instruction dataset followed by direct policy optimization (DPO), a technique with a similar objective as RLHF but removing the need of doing reinforcement learning. 23PT3.5 has 4096 token context window size (maximum input token limit).GPT4 allows 8192 tokens but we limited it to 4096 for fair comparison.All the Llama2 based models (Llama2-13b-chat, Vicuna-13b, Llama2-70b-chat, and Tulu2-70b) have 4096 token context window size.For GPT3.5 and GPT4, we are using them in a HIPAA-compliant environment provided through the Microsoft Azure cloud computing platform.Both GPT3.5 and GPT4 are accessed using the API version "2023-03-15-preview."The open-source models have a context window size of 4096 tokens and their querying is handled by FastChat, 28 which is an open-source LLM hosting platform deployable locally that allows querying LLM in a way similar to GPT3.5/GPT4 by internally handling each model's special input format.

Customized model
To better understand factors that impact LLM performance on section identification, we also train our own LLM on top of Llama2-13b-base.Similar to how Vicuna-13b is trained on Llama2-13b-base, we perform instruction tuning on Llama2-13b-base using the ORCA and/or progress dataset.We also try continued instruction tuning on top of Vicuna-13b.Due to hardware limitations, we employed LoRA 29 for model training.LoRA is a parameter-efficient training technique that tunes only a small number of model parameters.When tuning, we used a batch size of 16, epoch size of 1, learning rate of 0.0004, rank of 16, alpha of 16, dropout of 0.05, and LoRA modules of gate, down and up, following the set up in Lee et al 30 When training our customized model, we also vary the size and ratio of the ORCA and progress dataset.We train with ORCA, with its size varying from 25k, 50k, 100k, 250k to 500k.We also train with progress, with its size varying from 25, 50, 100, 250 to 500, and they are combined with an additional 25k ORCA examples.In preliminary experiments, we found that without any general-domain data (eg, ORCA), the training was less stable and the model seemed to overfit to the progress dataset's section types.We also tried continual instruction tuning on top of Vicuna-13b.We tuned it with 500 examples from progress and/or 25k question-answer pairs sampled from ORCA.Table 1 summarizes the customized models trained in this study.All the customized models are fine-tuned based on Llama2 so they all have 4096 token context window size.

Prompt design
The model prompt is a central part to evaluating LLMs.In this study, we designed the prompt following common practices in LLM querying as well as bypassing some known LLM limitations to make it suitable for applying to section identification.Figure 2 shows the prompt template used in this study.
Our prompt contains 4 components: system message, task description, section definitions, and clinical note.An example of the prompt, as well as the example output, is provided in Supplementary Material.The system message is a high-level instruction to LLMs about how it should behave.In our case, the system message is provided as "You are a helpful assistant.You are an experienced clinician and you are familiar with writing and understanding clinical notes." Then, we provide the task description (section identification) to the LLM-"A clinical note contains multiple sections like family history, allergies and history of present illness.Given a clinical note as an input, please separate the notes into sections and output their section names.Also specify where the section starts and ends.Use section names from one of the following:" Following the task description, we provide section definitions.They contain the section name, the definition, and example section headings.Since the discharge and progress datasets did not provide them, we derived them by inspecting the clinical notes from the non-test set.An example definition for the section type "Admit date" is "Admit date: the date of the patient's admit, example section headings include 'admission date', 'date of admission'."We created a section type "Unknown" for section types that could cause confusion to the model, such as "Subsection."We also modified some of the originally provided section labels to make them more understandable, such as changing "Physical" to "Physical examination."The section mappings and definitions are included in the Supplementary Material.Following the section definitions, we provide 2 example output sections to show the LLM how to format the extracted sections.We found in preliminary experiments that they are useful for models to produce consistent and desired output formats.We instruct the LLM to format sections to begin with a section index and a section name, then the beginning text of the section, and the end text of the section.Ideally, the LLM should output the start and end character index for the section directly, but LLMs do not natively reason in character offset space and have been shown to perform poorly at generating such outputs. 31,32Alternatively, LLMs are good at verbatim copying of text. 33,34Following the section definitions, we provide the LLM with a clinical note followed by an "Output:" prompt which signals the LLM to output the response.

Output postprocessing
Even with the output format included in the prompt, our models occasionally generated unstructured outputs during development.To avoid these, we employ a mechanism that retries the generation when it does not give the desired output format at the minimum; in other words, retry when the extracted section output is not like "Section [number]: [some text] Starts at: [some text] Ends at: [some text]." When the format is correct, there are also content errors such as invalid section names and section start/end text.With this observation, we design an output postprocessing algorithm to convert the output to section names and character spans for evaluating against the gold annotations.Assuming that, when given a clinical note, the LLM extracts a list of sections in the predefined format, and each section contains 3 pieces: section name, section start text, and section end text, we take the following steps.
1) Remove sections that do not have a valid section name.
A valid section name is defined to be exactly matching one of the section names as described in the section description part of the prompt.2) Remove sections that do not have a valid section start text.A valid section start text is defined to be having an exact match in the clinical note.3) For the remaining sections, use the section start text to locate the section's beginning character index.4) Sort sections in the order of the section beginning character index.5) Define a (tentative) section end character index as the next section's beginning character index-1.6) Find a section's end text in between the section's beginning and end character index.If a match is found, use that as the section's new end character index.
An advantage of this postprocessing method is it can still locate a section even when its section end text is incorrectly generated by the LLM, as long as the section start text is valid.Sometimes, LLMs make errors in verbatim copying the section name and section start/end text.We experiment with 2 variations of the above algorithm to mitigate them.
Section name ignore case match: In step 1, when we are matching section names, instead of exact match, we use case insensitive match when finding valid section names.
Fuzzy section start/end match: In steps 2 and 6, when we are matching section start/end text within the clinical note, we use fuzzy match instead of exact match.The fuzzy match lists all possible sequences in the clinical note and finds the sequence most similar to the start/end text.The similarity is scored based on Levenshtein distance, 35 which measures the similarity between 2 sequences as the length-normalized minimum number of single-character edits between them, and it ranges from 0 to 100.We define that a match occurs if the similarity score is greater than or equal to 90.

Evaluation
Due to budget limitations, we evaluate GPT4 and GPT3.5 for 3 repetitions and report the mean and standard deviation.For the open-source and customized models, we repeat the evaluation 5 times.We report micro-F1 scores instead of macro-F1 because section types are not evenly distributed.
Traditional supervised learning section identification methods used BIO tagging for evaluation. 11Limited by the computational complexity, they made sentence level prediction and evaluated based on that.However, LLMs generate characters one by one and do not have this concern.It is more suitable to evaluate with a span-based approach.Inspired by Beeferman et al, 36 which uses an asymmetric function for evaluating section boundaries, and nervaluate 37 on evaluating named entity recognition, we proposed a span-based evaluation method for section identification.This method considers both section span coverage and section label correctness.
We define an algorithm, Evaluate (L1, L2), for calculating L1's accuracy and match ratio against L2.L1 and L2 are both a list of sections with a span indicating the start and end character index of the section, and a section type.The algorithm is shown below.L1 can be either the predicted sections or the target sections.R is a list of section match ratio (r) for the matched L1 sections.N correct is the number of sections in L1 with a correct match in L2.Note that r is calculated for every section in L1.This helps us understand the model's span finding capability directly.

Off-the-shelf model
Table 2 shows the results of each system on the discharge dataset.GPT4 has the highest performance (F1 ¼ 0.77) while GPT-3.5 and Tulu2-70b score similarly with F1 ¼ 0.64.All other models score are not competitive.In terms of the stability over runs, the GPT models are most stable, while opensource models have lower stability.
We performed ablation studies by sequentially removing the 2 postprocessing enhancement techniques: fuzzy start/end text match and case-insensitive section name match.We found this had very little impact to GPT3.5 and GPT4 but enormous impact for the open-source models.Removing fuzzy start/end text match led to a substantial F1 score decrease (0.25) for Tulu2-70b as well as higher variations across runs, making it no longer competitive with GPT3.5.Other open-source models had similar large performance decreases.The F1 decreases to as low as 0.02 for Llama2-70b-chat.
Removing case-insensitive section name match further has the biggest impact on Llama-13b-chat, decreasing the F1 score from 0.12 to 0.02.Vicuna-13b's F1 score decreases by 0.06.Little impact was observed for the 70b models.

Customized model
Table 3 (top row) shows F1 scores for the customized LLMs instruction tuned on top of Llama2-13b-base using the ORCA and progress datasets.We found that F1 improves from 0.291 to 0.38 when changing the number of ORCA dataset size from 25k to 50k.The F1 remains stable with additional instances.
In Table 3 (bottom row), when adding domain-specific instructions to the dataset, adding 25 section identification examples from progress in addition to the 25k ORCA samples increased the F1 score slightly from 0.29 to 0.31.Also, increasing the number of instances to 50 and 100 further improved the F1 to around 0.35.Increasing the size of the examples to 100, 250, and 500 resulted in a F1 drop until 0.3.
In experiments where we tried to continue instruction tune on top of Vicuna-13b, a model already instruction tuned on top of Llama2-13b-base, we found that both continuous instruction tuning methods (Vicuna p-500 and Vicuna o-25k, p-500 ) resulted in a performance drop.

Analysis of GPT4 results
We take an in-depth look into GPT4's behavior to gain more insight about LLMs in section identification.We randomly selected one run for analysis.Table 4 shows the F1 scores by section types.There are 27 section types in total and we observed that 9 (33%) have an F1 score greater than 0.9, and
save S2 0 : the S2 that overlaps most with S1 by section span 6.
if S1 and S2 0 have the same section type: 9.
N correct ¼ N correct þ 1 10.accuracy ¼ N correct /L1 size 11.match ratio ¼ the average of R JAMIA Open, 2024, Vol. 7, No. 3 15 (56%) have an F1 score greater than 0.8.These are often common and important section types such as "Social history" and "Past medical history." Nevertheless, we found GPT4 still scores below or equal to 0.1 F1 for 2 section types (7%).The lowest performing section type for GPT4 is "Unknown" with F1 being 0. There are 4 predictions and 61 gold annotations.When matching the predictions against gold annotations, only 1 section was matched with correct section name.When matching the gold annotations against predictions, no section was matched with correct section names.This might be due to the different interpretations of "Unknown."The gold annotations tend to make the header of the note as "Unknown" sections, while GPT4 tends to label it for sections that have contents that it Another low-performance section type is "Admitting physician" (F1 ¼ 0.1).GPT4 made 19 predictions for "Admitting physician" and the gold annotations only have one instance annotated.We find that GPT4 tended to consider the sections that start with "Dictated by"/"Dictating for" and contains a physician name as the "Admitting physician" section (Figure 3).Those sections are also usually ignored in the gold annotations, resulting in the F1 prediction and gold annotation mismatch that we observed.Additional qualitative assessment shows that GPT4 might be better than what the F1 score suggests.Sometimes GPT4's misclassification comes from it interpreting section definition differently from the gold annotation.The gold annotation tends to annotate the whole note, and GPT4 tends to only annotate the parts that it is sure about.For example, on the left of Figure 4 is the gold annotation, and it annotates 2 "Discharge date" sections, in contrast to 1 section from GPT4 but essentially the same information.Both sections from the gold annotations are wider than what is annotated by GPT4, but it is difficult to say that GPT4 is wrong, and it depends on the use case.
When GPT4 does not match the gold annotations, it is sometimes an annotation error.In Figure 5, the gold annotation misannotated the "Discharge medication" as "Admission medication," while GPT4 annotated it correctly.
Traditional supervised learning models' accuracy for each section type tends to be higher if more annotations are available, and lower when fewer annotations are available. 12For GPT4, we found that the model does not have a clear association with a section's annotation counts (Figure 6).Nevertheless, it is worth noticing that for discharge, the inter-rater agreement was reported to be above 0.9 F1 score, 11 meaning that there is still a gap for LLMs to catch up.

Open-source models are closing the performance gap
All open-source models experienced a large performance drop when removing the fuzzy section start/end match in postprocessing; however, it is worth noticing that Vicuna-13b has the least performance drop across the open-source models.Vicuna-13b differs from the other 3 in that it only performs instruction tuning, without following RLHF or DPO.This might indicate that a second tuning stage based on human preference has the potential of reducing a model's verbatim copying ability.An explanation might be the so-called "Alignment tax," 38 where fine-tuned LLMs lose performance on language modeling tasks due to aligning with human preferences for better performance in chat applications.When removing the case-insensitive section name match further so that the models are evaluated without any postprocessing, both 13b models' performance continue to decrease, but not  for the 70b models.When comparing Tulu2-70b to the 13b models, this might indicate that larger models are better at copying the provided section names while keeping the original upper and lower cases.
Since the release of ChatGPT in the end of 2023, the LLM research community has been working towards reproducing it with open-source efforts, and the competitive performance of Tulu-2-70b against GPT3.5 in section identification is encouraging.However, we do notice that this is achieved with noticeable postprocessing efforts.Both accuracy and stability of Tulu-2-70b decreased without postprocessing while GPT3.5 and GPT4 still perform consistently.

Comparing between open-source models
When comparing between the open-source models, we found the 2 Llama-chat models are consistently underperforming from their same size equivalence.[23]

Efficient training of customized models
When training the customized LLMs with an increasing size of general domain dataset (ORCA), we found the resulted F1 score improves at the beginning, but the gain becomes diminishing afterwards.Alternatively, we experimented with training the LLM with a small amount of section identification examples and found they are effective in improving the F1 score.In the future, when people are trying to train LLMs for section identification, a cost-effective approach might be to train the model with a modest amount of general domain dataset and some section identification examples combined.It is also worth noticing that one should be careful about balancing the ratio of general domain dataset and section identification examples, as our experiments found that adding too many section identification examples can lead to a decreased model performance, likely due to overfitting.
Applying continued instruction tuning on top of already instruction-tuned LLM (Vicuna-13b) led to decreased model performance.This indicates that this approach could be sensitive to overfitting despite the strong initialization point.The performance drop of Vicuna o-25k, p-500 , however, is lower than that of Vicuna p-500 .This might be because Vicuna o-25k, p-500 is trained with an additional 25k general domain instances, alleviating the overfitting.

Limitations
Our study shows the effectiveness of LLM in section identification; however, due to hardware limitation, we only experimented with applying off-the-shelf models on discharge summaries, or training 13b models with parameter efficient training techniques (LoRA 29 ) on progress notes.To better understand LLM in section identification, future studies can consider extending the methodology to other types of clinical notes, such as radiology reports, as well as training larger models and using full fine-tuning.
Second, our study limits models' input token size to 4096 to enable straightforward comparisons between LLMs.GPT4 has a longer input limit and more expensive versions of GPT3.5 supports longer input.0][41] Understanding LLM's section identification performance for longer notes, or multiple notes in concatenation, could be a meaningful area.
Third, this study only considered notes that are within a certain length window and excluded the longer notes.For the longer notes, strategies of handling them include using longer context models like Llama3, 25 employing context window  extension techniques (eg, positional interpolation 42 ) and splitting the notes in to chunks and aggregating predictions, which requires testing of aggregation strategies.The advantages and disadvantages between these techniques can be explored in future studies.
Lastly, a limitation of this study is when evaluating the offthe-shelf and customized models, we were constrained by resources and only evaluated them on discharge, which contains 50 discharge summaries.The zero-shot nature of the experiments performed in this study suggests that the findings have the potential to generalize to other note types such as radiology reports and triage notes, but future work on more datasets and larger test sets is necessary for confirming this.In addition, for fine-tuning the LLMs, we also only used progress for training, and but it would be beneficial to test other note types for fine-tuning.The size of the progress dataset we used for fine-tuning ranged from 25 to 500.Although there are studies suggesting a few hundred notes are sufficient for creating an accurate (eg, F1 ¼ 0.9) supervised learning model for section identification, 9-11 it might be worth investigating using a larger dataset (eg, thousands of notes) for fine-tuning.

Conclusion
We evaluated the use of LLM for section identification under a highly transferrable framework.Our experiments show that GPT4 performed the best and achieved scores of 0.9 F1 or better for one third of the section types on a discharge summary dataset.Experiments suggest that the customized LLMs plateaued with an increasing number of the general domain instructions, and adding a reasonable amount of section identification examples is effective for improving model's performance.

Figure 1 .
Figure 1.A discharge summary with sections identified by GPT4, with date and age censored.

Figure 2 .
Figure2.Overview of the prompt (input) provided to the large language models for section identification.The same prompt is used across all experiments in this study.The section definition of the prompt (blue text) is adjusted based on the dataset (discharge or progress, see TablesS1 and S2for details).

Figure 4 .
Figure 4. Comparing human (left) and GPT4 (right) annotated clinical notes on discharge dates, with date censored.

Figure 3 .
Figure 3.An example of "Admitting physician" annotation difference between the human annotator (left) and GPT4 (right), censoring name and date.

Figure 5 .
Figure 5.An example of the "Discharge medication" section that is annotated wrongly by human but correctly by GPT4.

Figure 6 .
Figure 6.An illustration of non-dependent relationship between section counts (left y-axis) and F1 scores (right y-axis) for GPT4's predictions.The x-axis is the section types.

Table 1 .
Name, base model, and instruction tuning dataset size of the customized models in this study.

Table 2 .
The F1 and match ratio of the open-sourced and closed-sourced models, with sequential ablations of the fuzzy start/end text match and then case insensitive section name match techniques.

Table 3 .
The F1 and match ratio of customized models.

Table 4 .
The detailed performance values of one randomly selected GPT4 run after breaking down into section types, sorted by F1 scores.
could not understand, following what the prompt defines.